scHiMe: predicting single-cell DNA methylation levels based on single-cell Hi-C data

Abstract Recently a biochemistry experiment named methyl-3C was developed to simultaneously capture the chromosomal conformations and DNA methylation levels on individual single cells. However, the number of data sets generated from this experiment is still small in the scientific community compared with the greater amount of single-cell Hi-C data generated from separate single cells. Therefore, a computational tool to predict single-cell methylation levels based on single-cell Hi-C data on the same individual cells is needed. We developed a graph transformer named scHiMe to accurately predict the base-pair-specific (bp-specific) methylation levels based on both single-cell Hi-C data and DNA nucleotide sequences. We benchmarked scHiMe for predicting the bp-specific methylation levels on all of the promoters of the human genome, all of the promoter regions together with the corresponding first exon and intron regions, and random regions on the whole genome. Our evaluation showed a high consistency between the predicted and methyl-3C-detected methylation levels. Moreover, the predicted DNA methylation levels resulted in accurate classifications of cells into different cell types, which indicated that our algorithm successfully captured the cell-to-cell variability in the single-cell Hi-C data. scHiMe is freely available at http://dna.cs.miami.edu/scHiMe/.


Data sets
We applied the same quality control as in research (Lee, et al., 2019), based on which we only kept the cells that had total non-clonal reads > 500,000, global mCCC < 3, total autosomal cytosines covered < 100M, and total long-range (>10,000 bp) cis contacts > 5000. Additionally, the total long-range cis contacts of each chromosome need to be larger than x, where x is the length of a chromosome at Mb resolution.

Building meta-cells for single-cell Hi-C data
For each chromosome of every cell, we built a promoter-promoter contact matrix based on the single-cell Hi-C contacts of that cell (not meta-cell). Each promoter region here was defined as containing 50 kb upstream and 50 kb downstream of the TSS plus the TSS of a gene. Notice that a promoter was defined as a 1 kb region upstream of a TSS for methylation meta-cell and prediction (details will be discussed in later sections), and here we made each promoter region larger. This is because in this way more single-cell Hi-C contacts can fall in the promoter regions. In other words, having the regions too small will result in few Hi-C contacts between the regions, which will not adequately define the graph topology of the promoter-promoter interaction networks. A contact was considered to exist between two promoters if one end of a Hi-C read-pair was located inside a promoter region and the other end was located inside another promoter region.
We performed the incremental principal component analysis function (IPCA) using the sklearn package (Pedregosa, et al., 2011)  components. After that, we used these 30 components to compute the Euclidean distance between each pair of cells. Finally, we picked up the 20 nearest neighboring cells and then combined them with the central cell to generate the meta-cell for the target cell. The Hi-C contacts in the meta-cell were used to define edges in the promoter-promoter spatial interaction network.
For the cells used for blind test or benchmarking, we considered the type of each cell unknown information but the number of total cell types in the testing data set known information because most of the biological labs would have known how many and what cell types they have used in their experiments. Because of these, we clustered all the cells in the test dataset and then generated the meta-cell for each cell only using the other cells in the same cluster. We executed scHiCluster (Zhou, et al., 2019) on the intrachromosomal Hi-C contact pairs of all the cells at 1 Mb resolution. The results of scHiCluster are × ( − 1) components, where is the number of cells and − 1 is the number of components for each cell. We used the spectral clustering function from the sklearn package (Pedregosa, et al., 2011) to cluster all the cells by labeling the cells based on the output components of scHiCluster with the arguments of affinity = nearest_neighbors and cluster = 4. We used all of the components generated from scHiCluster to compute the Euclidean distance between all cell pairs.
After that, we selected the 20 nearest neighboring cells in the same cluster of the target cell and then combined the Hi-C contacts of these 20 neighboring cells with Hi-C contacts of the target cell to generate the aggregated Hi-C contacts for the target cell.

Building meta-cells for single-cell DNA methylation data
The single-cell methylation data that we have used provide the number of methylated reads and the total number of reads for individual CpG. We calculated an overall methylated ratio for each promoter of a cell by dividing the sum of the methylated reads (only on CpGs) available for the promoter by the total number of reads (only on CpGs) available for the promoter. Some of the promoters in some of the cells have no methylation data at all, that is, none of the CpGs in the promoters have any reads available or the promoters do not contain any CpG.
For those promoters, we computed the mean overall methylated ratios based on all of the other promoters from the same cell and used that mean value as the overall methylated ratio for each of those promoters that did not have methylation data available.
We calculated a variance value for each promoter based on the overall methylated ratios of the promoter in all cells. After that, we selected 5000 promoters with the highest variance values and performed IPCA on their overall methylated ratios. For each promoter, we obtained the top 30 components, which were used to calculate the Euclidean distance between every cell pair. Based on the Euclidean distances, we chose the 20 nearest neighboring cells and then combined them with the target cell to generate the meta-cell.
The aggregated methylation levels of the meta-cells were used as the true or target methylation levels of the target cell. For example, for base pair 1 in the promoter 1 , five of the 21 cells in the meta-cell have methylation data available, that is, each of the five cells provides both the number of methylated reads on that base pair and the total number of reads on that base pair. The true methylation level or the target value for base pair 1, which was a value in the range of [0, 1], was calculated as the total number of methylated reads from the five cells divided by the total number of reads from the same five cells. If none of the 21 cells in the meta-cell had methylation levels available on a base pair, then we labeled -1 as the target value for the base pair. If a nucleotide base in a promoter was not the cytosine (C) of a CpG, we used -1 as the target value for the base pair, unless the nucleotide base was guanine (G) and its complementary base was the cytosine (C) of a CpG. In that case, we used the methylation level of the complementary base as the target value of the base pair.
In summary, each promoter has 1000 target values, one for each base pair in the promoter region. Notice that the final predictions from scHiMe also contain 1000 values for each promoter (one for each base pair), and the predictions are provided based on the reference genome, which is the DNA sequence of the positive strand. Therefore, the predicted methylation level on a G of a CpG dinucleotide can be considered as the prediction made for its complementary C in the negative strand.

Node and edge features
The first edge feature contains 21 integers, which are the numbers of Hi-C contacts, between the two promoters of the base-10 logarithm of the genomic distance between the two promoters that are connected by the edge. The genomic distance here is at a resolution of 1 bp. In total, the edge feature for each edge consists of 23 values.

Training, validation, and blind test
We used the Adam algorithm (Paszke, et al., 2019) as our optimizer to train the graph transformer. We used the function ReduceLROnPlateau in the torch.optim package (Paszke, et al., 2019) with the parameters mode=min, patience = 5, and factor = 0.1" to automatically reduce learning rates. The learning rate was set to 0.001, and the batch size was set to 1. The training loss was calculated as the average absolute differences between the predicted methylation levels and the target or ground-truth methylation levels. The training loss was calculated on the base pairs that did not have -1 as the target values.
We used the Vip and Sst cell types from the first data set to generate the validation data. There are 2254 graphs for validating the graph-transformer model. We terminated the training process when the validation loss stopped decreasing for five consecutive epochs or the number of epochs reaches 100. The model with the smallest validation loss was selected as the final model and benchmarked on the blind test data sets.
We used Endo, L23, MG, and OPC cell types from the first and second data sets, which corresponded to top 100 cells having the highest number of Hi-C contacts from each of the K562, HAP1, and HeLa cell types to create the extra blind test data.

scHiMe-exon1
We analyzed the lengths of the first exons of all genes in the GENCODE human annotation version 19 dataset and found that 97% of the first exons had lengths within 2000 base pairs to the transcription start site (TSS).
Since the first intron usually exists between the TSS and the first exon, we believe the 2000 bp range should cover almost all of the first introns. Based on this observation, we defined a node as a DNA sequence consisting of 1999 base pairs upstream and 500 base pairs downstream of the transcription start site (TSS), plus the TSS itself. We then used a 6-mer DNABERT to extract node features for each node. For edge features, we used scHiMe-promoter's edge features for scHiMe-exon1. We trained the model of scHiMe-exon1 using the same training parameters as scHiMe-promoter and generated training, validation, and blind test data using the same cell type as scHiMe-promoter.

scHiMe-whole-genome
We first partitioned the entire genome into non-overlapping segments of 1000 base pairs and then for each segment calculated the ratio of base pairs that have true methylation data available. We plotted a histogram, see supplementary Fig. S7, showing the distribution of the numbers of segments in each cell that have this ratio > 0.1. We picked up a representative cell that existed in the middle range of the histogram and then used all of its segments that have > 0.1 ratios to be the template segments, which means for all of the other cells the same segments will be used for generating training, validation, and blind-test datasets, even though their numbers of segments with > 0.1 ratios may be smaller than the representative cell, or the same segments in some other cells may not have the ratio > 0.1. In this way, our data set consists of various types of segments or cells, regardless of their true methylation ratios. We generated node features using a 6-mer DNABERT model and built edge and edge features using the same methods as for scHiMe-promoter. We created training, validation, and blind test datasets using the same cell type as scHiMe-promoter. We trained the scHiMe-whole-genome model using the same training parameters as scHiMe-promoter, except for the learning rate which was set to 0.0001.

Naive predictors
Since there is no other published tool to compare performances, we built two naive predictors. The predictions for a promoter from naive-predictor-1 are the average methylation levels gathered from the radius-one neighboring promoters in the promoter-promoter spatial interaction network. For naive-predictor-2, we defined a target region for each of the base pairs in the target promoter, which contains the target base pair, 50 bp upstream of the target base pair, and 50 bp downstream of the target base pair. All of the single-cell Hi-C contacts with one end falling into the target region were collected. For each of the other ends in these Hi-C contacts, we defined an in-contact region, which contains the in-contact base pair that formed the Hi-C contact with the target base pair, 50 bp upstream of the in-contact base pair, and 50 bp downstream of the in-contact base pair. An average methylation value was calculated based on all of the available ground-truth per-base-pair methylation levels within all of the in-contact regions, which was considered the final prediction from naive-predictor-2 for the target base pair.

Cell classification and its evaluation
Besides clustering cells using all promoters, we also benchmarked the performances of clustering cells using individual promoters or individual CpGs because we believed this provided biologists with more information about which promoters or CpGs and their methylation levels were important in differentiating cells to different cell types. In these cases, the same methodologies for clustering, calculating ARI scores, and visualization were applied with the only difference being that the methodologies were applied on the methylation levels of one or selected promoter(s), cytosine(s), or guanine(s).

Importance of individual CpGs in promoters for clustering cells
We used the predicted methylation levels on all of the CpGs, including both cytosines and guanines, from all of the promoters genome-wide as the features. In other words, each training example is for one cell with the target value as the true cell type for that cell, followed by 1443574 predicted methylation levels. We used all of the 80% of the cells were used for training, and 20% for validation. We tested different values for the number of trees including 100 and 1000, which resulted in classification accuracies of 0.857 and 0.995, respectively. Based on these, the feature importance was generated based on the random forest model that consisted of 1000 trees.
Each of the cytosines and guanines in all of the CpGs had a feature importance value output from the random forest algorithm, which indicated the importance of each cytosine or guanine when used to classify cells into the four cell types.

Collective Influence algorithm
We applied the collective influence (CI) algorithm (Morone and Makse, 2015) on the promoter-promoter spatial interaction networks built on data set 2. A promoter-promoter spatial interaction network was built for each chromosome of a cell, and all of the promoter-promoter interaction networks of a cell were saved in one file and input into the CI algorithm for finding network influencers. The nodes that had no edge connecting to any other nodes were removed. A list of influencers ranked from the most significant to the least significant influencers was outputted from the CI algorithm for each cell.
To find the relationship between the ranking position of each promoter in the influencer list and the ARI score of using the promoter to classify cells, we defined two ways of calculating the average ranking position in the influencer list for each promoter, which are named avg_inf_rank1 and avg_inf_rank2. For each promoter in avg_inf_rank1, if its ranking position for a cell was smaller than , then this original ranking position was used. If its ranking position for a cell was larger than , or if the promoter was not even included in the ranking list for a cell, then + 1 was used as its ranking position for that cell. Two values for threshold were tested, which were 10 and 17822. When using 10, we focused on the cases where a promoter was among the top 10 most significant influencers. When using 17822, we included all possible ranking positions since the longest ranking list contained 17822 influencers. When using the threshold 17822, we only used the cells that had at least 16000 influencers output from the CI algorithm.
For each promoter in avg_inf_rank2, if its ranking position for a cell is smaller than , then this ranking position was kept. We gathered all of the ranking positions for the promoter that meet this criterion and then calculated the average ranking position for that promoter. This process was repeated for all of the promoters.
The thresholds that we tested were 10 and 17822. When using 17822 as the threshold, we only used the cells that had at least 16000 influencers output from the CI algorithm.

Cell classification and its evaluation
Supplementary Fig. S2A shows the ARI-all-data scores of clustering cells based on the predicted methylation levels of all cytosines and guanines of the individual or selected promoters on data set 2. We used the scHiMepredicted methylation levels for each promoter to cluster cells and then ranked the promoters based on the ARIall-data scores. It can be found that using only the first-ranked promoter resulted in an ARI score of 0.664 and using the top 32 promoters brought the ARI score to 0.898. However, using more than the top 50 promoters did

Performance from the perspective of network influencer
We also benchmarked the performance of scHiMe from the perspective of network influencers. Supplementary   Fig. S9 shows the distribution of the number of influencers found for all the cells in data set 2. It can be found that 85% of the cells have from 14007 to 17822 influencers detected from their promoter-promoter spatial interaction networks. We studied the degrees of the top 10 influencers for all cells in data set 2, see the distribution in supplementary Fig. S5 A. We did find higher mean degree values for the first-ranked and secondranked influencers, which are 7.374 and 6.964, respectively. For the influencers ranked from the third to the tenth, we find that they have similar mean degree values.
Supplementary Fig. S5 B shows the evaluation metrics for the influencers ranked from the first to the fourteenth on data set 2. We did not find a trend of increasing or decreasing for all of the evaluation metrics with the increase of the influencer ranking positions.
We further investigated the relationship between influencer significance and the feature importance of the cytosines and guanines in CpGs on data set 2, see supplementary Fig. S10 and S11. Similar to the five evaluation metrics, we did not find an obvious relationship between influencers and feature importance.
We studied whether the most significant influencers resulted in higher ARI-all-data scores when the predicted methylation levels on the influencers were used to cluster cells into different cell types. Supplementary Fig. S5 C shows our benchmark results on all of the cells but only considers the top 10 most significant influencers. Supplementary Fig. S13 shows the results using avg_inf_rank2 on data set 2, which indicates the same conclusions.
Last but not least, we picked up the influencers that were first-ranked in the influencer lists of all cells and counted the number of occurrences of each promoter as the first-ranked influencer among all cells. The promoters that have the same number of occurrences were used to cluster cells and the ARI-all-data scores were calculated. Besides the first-ranked influencers, we also conducted the same experiments for the second-ranked, third-ranked, fourth-ranked, and fifth-ranked influencers, see Supplementary Fig. S14 A-E. No significant Pearson correlation or relationship was found between the ARI-all-data scores and the number of occurrences.
We also picked up the top five most occurred influencers for influencer ranking positions 1 to 93 (the shortest influencer list contained 93 influencers) and used the top five influencers to clustering cells, see Supplementary   Fig. S14 F, from which the same conclusion was drawn. Figure S1. A-B. The absolute differences between the naive predictor 1-predicted and true methylation levels on the blind-test cells in data sets 1 and 2. C-D. Similar to A-B, but on naive predictor 2-predicted methylation levels.              Table S7. The top 20 promoters' gene names with the highest ARI-all-data scores based on scHiMe-promoter predicted methylation levels in the data sets 1 and 2.